Adversarial training is an effective approach to make deep neural networks robust against adversarial attacks. Recently, different adversarial training defenses are proposed that not only maintain a high clean accuracy but also show significant robustness against popular and well studied adversarial attacks such as PGD. High adversarial robustness can also arise if an attack fails to find adversarial gradient directions, a phenomenon known as `gradient masking'. In this work, we analyse the effect of label smoothing on adversarial training as one of the potential causes of gradient masking. We then develop a guided mechanism to avoid local minima during attack optimization, leading to a novel attack dubbed Guided Projected Gradient Attack (G-PGA). Our attack approach is based on a `match and deceive' loss that finds optimal adversarial directions through guidance from a surrogate model. Our modified attack does not require random restarts, large number of attack iterations or search for an optimal step-size. Furthermore, our proposed G-PGA is generic, thus it can be combined with an ensemble attack strategy as we demonstrate for the case of Auto-Attack, leading to efficiency and convergence speed improvements. More than an effective attack, G-PGA can be used as a diagnostic tool to reveal elusive robustness due to gradient masking in adversarial defenses.
translated by 谷歌翻译
Objective: Despite numerous studies proposed for audio restoration in the literature, most of them focus on an isolated restoration problem such as denoising or dereverberation, ignoring other artifacts. Moreover, assuming a noisy or reverberant environment with limited number of fixed signal-to-distortion ratio (SDR) levels is a common practice. However, real-world audio is often corrupted by a blend of artifacts such as reverberation, sensor noise, and background audio mixture with varying types, severities, and duration. In this study, we propose a novel approach for blind restoration of real-world audio signals by Operational Generative Adversarial Networks (Op-GANs) with temporal and spectral objective metrics to enhance the quality of restored audio signal regardless of the type and severity of each artifact corrupting it. Methods: 1D Operational-GANs are used with generative neuron model optimized for blind restoration of any corrupted audio signal. Results: The proposed approach has been evaluated extensively over the benchmark TIMIT-RAR (speech) and GTZAN-RAR (non-speech) datasets corrupted with a random blend of artifacts each with a random severity to mimic real-world audio signals. Average SDR improvements of over 7.2 dB and 4.9 dB are achieved, respectively, which are substantial when compared with the baseline methods. Significance: This is a pioneer study in blind audio restoration with the unique capability of direct (time-domain) restoration of real-world audio whilst achieving an unprecedented level of performance for a wide SDR range and artifact types. Conclusion: 1D Op-GANs can achieve robust and computationally effective real-world audio restoration with significantly improved performance. The source codes and the generated real-world audio datasets are shared publicly with the research community in a dedicated GitHub repository1.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Although existing semi-supervised learning models achieve remarkable success in learning with unannotated in-distribution data, they mostly fail to learn on unlabeled data sampled from novel semantic classes due to their closed-set assumption. In this work, we target a pragmatic but under-explored Generalized Novel Category Discovery (GNCD) setting. The GNCD setting aims to categorize unlabeled training data coming from known and novel classes by leveraging the information of partially labeled known classes. We propose a two-stage Contrastive Affinity Learning method with auxiliary visual Prompts, dubbed PromptCAL, to address this challenging problem. Our approach discovers reliable pairwise sample affinities to learn better semantic clustering of both known and novel classes for the class token and visual prompts. First, we propose a discriminative prompt regularization loss to reinforce semantic discriminativeness of prompt-adapted pre-trained vision transformer for refined affinity relationships. Besides, we propose a contrastive affinity learning stage to calibrate semantic representations based on our iterative semi-supervised affinity graph generation method for semantically-enhanced prompt supervision. Extensive experimental evaluation demonstrates that our PromptCAL method is more effective in discovering novel classes even with limited annotations and surpasses the current state-of-the-art on generic and fine-grained benchmarks (with nearly $11\%$ gain on CUB-200, and $9\%$ on ImageNet-100) on overall accuracy.
translated by 谷歌翻译
Owing to the success of transformer models, recent works study their applicability in 3D medical segmentation tasks. Within the transformer models, the self-attention mechanism is one of the main building blocks that strives to capture long-range dependencies, compared to the local convolutional-based design. However, the self-attention operation has quadratic complexity which proves to be a computational bottleneck, especially in volumetric medical imaging, where the inputs are 3D with numerous slices. In this paper, we propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters and compute cost. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features using a pair of inter-dependent branches based on spatial and channel attention. Our spatial attention formulation is efficient having linear complexity with respect to the input sequence length. To enable communication between spatial and channel-focused branches, we share the weights of query and key mapping functions that provide a complimentary benefit (paired attention), while also reducing the overall network parameters. Our extensive evaluations on three benchmarks, Synapse, BTCV and ACDC, reveal the effectiveness of the proposed contributions in terms of both efficiency and accuracy. On Synapse dataset, our UNETR++ sets a new state-of-the-art with a Dice Similarity Score of 87.2%, while being significantly efficient with a reduction of over 71% in terms of both parameters and FLOPs, compared to the best existing method in the literature. Code: https://github.com/Amshaker/unetr_plus_plus.
translated by 谷歌翻译
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Since training on a similar scale for videos is infeasible, recent approaches focus on the effective transfer of image-based CLIP to the video domain. In this pursuit, new parametric modules are added to learn temporal information and inter-frame relationships which require meticulous design efforts. Furthermore, when the resulting models are learned on videos, they tend to overfit on the given task distribution and lack in generalization aspect. This begs the following question: How to effectively transfer image-level CLIP representations to videos? In this work, we show that a simple Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos. Our qualitative analysis illustrates that the frame-level processing from CLIP image-encoder followed by feature pooling and similarity matching with corresponding text embeddings helps in implicitly modeling the temporal cues within ViFi-CLIP. Such fine-tuning helps the model to focus on scene dynamics, moving objects and inter-object relationships. For low-data regimes where full fine-tuning is not viable, we propose a `bridge and prompt' approach that first uses fine-tuning to bridge the domain gap and then learns prompts on language and vision side to adapt CLIP representations. We extensively evaluate this simple yet strong baseline on zero-shot, base-to-novel generalization, few-shot and fully supervised settings across five video benchmarks. Our code is available at https://github.com/muzairkhattak/ViFi-CLIP.
translated by 谷歌翻译
我们考虑无上行赠款非正交多访问(NOMA)中的多用户检测(MUD)问题,其中访问点必须确定活动互联网(IoT)设备的总数和正确的身份他们传输的数据。我们假设IoT设备使用复杂的扩散序列并以随机访问的方式传输信息,按照爆发 - 距离模型,其中一些物联网设备以高概率在多个相邻的时间插槽中传输其数据,而另一些物联网设备在帧中仅传输一次。利用时间相关性,我们提出了一个基于注意力的双向长期记忆(BILSTM)网络来解决泥浆问题。 Bilstm网络使用前向和反向通过LSTM创建设备激活历史记录的模式,而注意机制为设备激活点提供了基本背景。通过这样做,遵循了层次途径,以在无拨款方案中检测主动设备。然后,通过利用复杂的扩散序列,对估计的活动设备进行了盲数据检测。所提出的框架不需要对设备稀疏水平和执行泥浆的通道的先验知识。结果表明,与现有的基准方案相比,提议的网络的性能更好。
translated by 谷歌翻译
在手术室(OR)中,活动通常与其他典型的工作环境不同。特别是,外科医生经常受到多种心理组织的约束,可能会对他们的健康和表现造成负面影响。这通常归因于相关的认知工作量(CWL)的增加,该工作量是由于处理意外和重复性任务以及大量信息以及潜在风险的认知超载而导致的。在本文中,建议在多种四个不同的手术任务中对CWL的多模式识别提出了两种机器学习方法。首先,使用基于转移学习概念的模型来确定外科医生是否经历任何CWL。其次,卷积神经网络(CNN)使用此信息来识别与每个手术任务相关的不同类型的CWL。建议的多模式方法考虑来自脑电图(EEG),功能近红外光谱(FNIRS)和瞳孔眼直径的相邻信号。信号的串联允许在时间(时间)和通道位置(空间)方面进行复杂的相关性。数据收集是由多种感应的AI环境来执行的,用于在Harms Lab开发的手术任务$ \&$角色优化平台(Maestro)。为了比较拟议方法的性能,已经实施了许多最先进的机器学习技术。测试表明,所提出的模型的精度为93%。
translated by 谷歌翻译
如今,使用微创手术(MIS)进行了更多的手术程序。这是由于其许多好处,例如最小的术后问题,较少的出血,较小的疤痕和快速的康复。但是,MIS的视野,小手术室和对操作场景的间接查看可能导致手术工具发生冲突并可能损害人体器官或组织。因此,通过使用内窥镜视频饲料实时检测和监视手术仪器,可以大大减少MIS问题,并且可以提高手术程序的准确性和成功率。在本文中,研究,分析和评估了对Yolov5对象检测器的一系列改进,以增强手术仪器的检测。在此过程中,我们进行了基于性能的消融研究,探索了改变Yolov5模型的骨干,颈部和锚固结构元素的影响,并注释了独特的内窥镜数据集。此外,我们将消融研究的有效性与其他四个SOTA对象探测器(Yolov7,Yolor,Scaled-Yolov4和Yolov3-SPP)进行了比较。除了Yolov3-SPP(在MAP中具有98.3%的模型性能和相似的推理速度)外,我们的所有基准模型(包括原始的Yolov5)在使用新的内窥镜数据集的实验中超过了我们的顶级精制模型。
translated by 谷歌翻译
在过去的十年中,基于深度学习的算法在遥感图像分析的不同领域中广泛流行。最近,最初在自然语言处理中引入的基于变形金刚的体系结构遍布计算机视觉领域,在该字段中,自我发挥的机制已被用作替代流行的卷积操作员来捕获长期依赖性。受到计算机视觉的最新进展的启发,遥感社区还见证了对各种任务的视觉变压器的探索。尽管许多调查都集中在计算机视觉中的变压器上,但据我们所知,我们是第一个对基于遥感中变压器的最新进展进行系统评价的人。我们的调查涵盖了60多种基于变形金刚的60多种方法,用于遥感子方面的不同遥感问题:非常高分辨率(VHR),高光谱(HSI)和合成孔径雷达(SAR)图像。我们通过讨论遥感中变压器的不同挑战和开放问题来结束调查。此外,我们打算在遥感论文中频繁更新和维护最新的变压器,及其各自的代码:https://github.com/virobo-15/transformer-in-in-remote-sensing
translated by 谷歌翻译